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ABSTRACT 

A density based hierarchical group-finding algorithm is used to identify stellar halo structures in a catalog of 
M-giants from the Two Micron All Sky Survey (2MASS). The intrinsic brightness of M-giant stars means that 
this catalog probes deep into the halo where substructures are expected to be abundant and easy to detect. Our 
analysis reveals 16 structures at high Galactic latitude (greater than 15°), of which 10 have been previously 
identified. Among the six new structures two could plausibly be due to masks applied to the data, one is asso- 
ciated with a strong extinction region and one is probably a part of the Monoceros ring. Another one originates 
at low latitudes, suggesting some contamination from disk stars, but also shows protrusions extending to high 
latitudes, implying that it could be a real feature in the stellar halo. The last remaining structure is free from the 
defects discussed above and hence is very likely a satellite remnant. Although the extinction in the direction of 
the structure is very low, the structure does match a low temperature feature in the dust maps. While this casts 
some doubt on its origin, the low temperature feature could plausibly be due to real dust in the structure itself. 
The angular position and distance of this structure encompass the Pisces overdensity traced by RR Lyraes in 
Stripe 82 of the Sloan Digital Sky Survey (SDSS). However, the 2MASS M-giants indicate that the structure 
is much more extended than what is visible with the SDSS, with the point of peak density lying just outside 
Stripe 82. The morphology of the structure is more like a cloud than a stream and reminiscent of that seen in 
simulations of satellites disrupting along highly eccentric orbits. This finding is consistent with expectations of 
structure formation within the currently favored cosmological model: assuming the cosmologically-predicted 
satellite orbit distributions are correct, prior work indicates that such clouds should be the dominant debris 
structures at large Galactocentric radii (~ 100 kpc and beyond). 

Subject headings: galaxies: halos - galaxies structure- methods:data analysis - methods numerical 



1. INTRODUCTION 

Under the currently favored ACDM model of galaxy for- 
mation, the stellar halo is thought to have been built up, 
at least in part, hierarchically through mergers of smaller 
satellite systems. Signatures of these mer gers should be 
apparent as structures in the stellar halo ( [Johnston 19981: 
Helm i & White! fl999l iBullock et all |200H Uohnston et all 
20081) . In recent years observations have lent support to 
the hierarchical picture with the discovery of a number of 
streams and structures of stars in the stellar halo of the 
Milky Way. The most prominent of these structures are 
the ti d al tail s of the Sagittar i us dw arf galaxy (Ibat a et al.l 
[1991 fl995k iMaiewski et al.1 l200l . the Virgo overden- 
sitv (lJuric et al.ll2008|), the Triangulum- Andromeda structure 
(iRocha-Pinto et al.1 12061 iMajewski eTall 1200 4: Martin et al. 
2007) and the low latitude Monoceros ring ( Newberg et al. l 
2002; Yanny et al. 2003; Penarrubia et al. 2005; Martin et al. 



2005) 



The mapping of these low surface brightness structures can 
be attributed to the advent of large scale stellar catalogs de- 
rived from surveys such as the Two Micron All Sky Survey 
(2MASS) and the Sloan Digital Sky Survey (SDSS). Typi- 
cally, a judicious color selection is applied to objects in a sur- 
vey in order to maximise the presence of stars with some well- 
defined absolute magnitude range. Structures are then identi- 
fied by visually inspecting sky-projections of the stellar den- 
sity in slices of apparent magnitude. Future surveys, such as 



GAI A dPerrvm an 2002), LSST ( Ivezic et al.l l2009) SkyMap- 
per dKeller et al J2007b and PanSTARRS, will explore the stel- 
lar halo to greater depth, with even larger numbers of stars 
and in more dimensions and should be sensitive to even more 
structures. 

While discovery by visual inspection has proved successful 
so far, the scale and sophistication of the maps generated from 
these data sets (both current and future) motivate an explo- 
ration of methods that can instead objectively identify struc- 
tures. This task is well suited to clustering algorithms, which 
have enjoyed great success in other areas of astronomy, e.g. , 
identifying galaxy groups in redshift surveys (Eke et al. 2004) 
or identifying halos in co s mological simulation s (|Reed et aD 
120071: iJenJdnseTaDIIOOlt lLacev & Col3ll99l . The stellar 
halo presents unique challenges for such algorithms. The 
structures in the stellar halo have arbitrary shapes, they span 
a wide range of densities that cannot be separated by a single 
isodensity contour and they can have nested substructures. In 
this paper we present an objective analys is of substructures 
in the stellar halo using the code EnLink ( Shar ma & Johnston! 
2009), which is a density-based hierarchical group finder. The 
code is ideally suited for this application for four reasons. 
First, a density-based group-finder is able to identify irreg- 
ular groups. Second, EnLink' s clustering scheme can identify 
groups at all density levels. Third, EnLink' s organizational 
scheme allows the detection of the full hierarchy of structures. 
Finally, the group finder gives an estimate of the significance 
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of the groups, so spurious clusters can be ignored. 

Among the existing surveys, the 2MASS catalog of M-giant 
stars and the SDSS catalog of F and G type main-sequence- 
turnoff (MSTO) stars provide the clearest global views of the 
stellar halo. While SDSS contains a larger number of stars 
than 2MASS M-giants, it covers only about 10,000 deg 2 (1/4 
of the sky) in area. Moreover, the magnitude limit of SDSS 
means t hat MSTO stars can probe the stellar halo only out to 
35 kpc dBell et al.l2008l) while M-gian t stars in 2MASS probe 
out to 100 kpc dMajewski et al.ll2003l) . This implies that the 
M-giant stars in 2MASS not only cover a factor of about 90 
in volume more than the MSTO stars in SDSS, but also probe 
the outer halo where the substructures are expected to be more 
abundant and have higher density contrast. Hence, we choose 
to apply EnLink to the 2MASS M-giant sample with the aim 
of objectively identifying substructures within it. 

Note that using M-giants as tracers also has its share of dis- 
advantages. First, M-giants are a rare population so the to- 
tal size of the survey is much smaller than the SDSS MSTO 
sample. Second, M-giants are metal rich, intermediate-age 
stars with metallicity [Fe/H] typically greater than > —1.5. 
Hence, applying a group finder to an M-giant survey will pref- 
erentially detect high metallicity debris from the few massive 
recently-accreted objects and will be insensitive to ancient or 
low-metallicity debris that originates from the many more low 
mass progenitors. The advantage of this bias against ancient 
or low-metallicity stars is that it will increase the sensitivity to 
the rare, recent, high-mass events. However, building a cen- 
sus of debris from all types of accreting objects would require 
combining these results with those from other surveys — to be 
discussed in detail in a forthcoming paper (S. Sharma et al. 
2010, in preparation). 

The paper is organized as follows: Section [21 describes the 
2MASS M-giant data set used in the paper; Section [3] dis- 
cusses the methods employed for analyzing the data, i.e., 
group finding; in Section [4] we describe the structures iden- 
tified by the group-finder in the 2MASS M-giant sample; and 
finally, we summarize our findings in Section \5\ 

2. SELECTING M-GIANT HALO STARS FROM THE 2MASS DATA 

The 2MASS all sky point source catalog contains about 
471 million objects (the majority of which are stars) with pre- 
cise astrometric positions on the sky and photometry in three 
bands J, H, and K s . The survey catalog is 99% complete for 
K s < 14.3. An initial sample of candidate M-giants was gen- 
erated by applying the selection criteria: 

K s <U.O (1) 

J-K s >0.85 (2) 

J -H< 0.561( J - K s ) + 0.36 (3) 

J -H> 0.561( J - K s ) + 0.19. (4) 

All magnitudes in the above equations are in the intrin- 
sic, dereddened 2MASS system (labeled wi th subscript 
herea fter), with dereddening applied using the Schlegel et al. 
(1998) extinction maps. These selection criteria and 
the dereddening met hod are similar to those used by 
iMaiewski et al] (120031) to identify the tidal tails of Sagittar- 
ius dwarf galaxy. In general, for (J — K s )o > 0.85 giants 
begin to separate from dwarfs in the near-infrared color-color 
diagram, with redder colors leading to better discrimination. 
However, the number density of giants in the catalog falls off 
rapidly as a function of color. As a compromise between qual- 
ity (i.e., the level of contamination by disk dwarfs) and quan- 
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Fig. 1. — Latitude vs. longitude scatter plot of M-giant stars identified 
in the 2MASS data. Top: original data containing extinction regions at low 
latitudes. Bottom: distribution of stars after masking the extinction regions 
by means of rectangular patches and retaining stars with latitude b > 15°. 

tity we restrict our search to stars with (J—K s )o > 0.97. This 
generates a list of about 450, 000 stars spanning a magnitude 
range of 4.12 - 14.0 in the (K s )o band. 

Since we are interested in the stellar halo, we further re- 
fine our selection with geometrical factors aimed at reduc- 
ing contamination by foreground disk stars, as well as adopt- 
ing masks to cover regions of high dust extinction. First, 
we impose the twin requirements that (K s )o > 10 and 
(K s )os'm(b) > 14.0sin(15°). The former condition gets rid 
of stars near the Sun, while the latter limits the contribution 
by stars that are further away, but lie close to the Galactic 
plane. At low latitudes the distribution of stars is not con- 
tiguous owing to the presence of extinction clouds, which in 
some regions extend to a latitude of 30°. To avoid identify- 
ing spurious structures and at the same time retain as much 
low latitude data as possible we mask the high extinction re- 
gions by means of a set of rectangles in (/, b) space, as shown 
in Figure [T] Finally, there are some extinction holes in the 
region of the Large Magellanic Cloud (LMC). We fill these 
up by identifying the stars lying within a region defined by 
- 280°.0) 2 + (b + 33°.0) 2 < 10° and adding a disper- 
sion of 1° to their original latitude and longitude coordinates, 
as illustrated in the left and right panels of Figure [2 After 
applying all of the selection criteria, the final sample contains 
59, 392 stars. An Aitoff plot of these M-giants is shown in 
Figure [3l 

A particularly useful property of M-giants is that their abso- 
lute magnitude varies approximately linearly with their color 
and can be expressed as 

M Ks =A + B(J-K S ) (5) 

A slope of B = —9.42 was found to be a good fit, in the 
regime 0.97 < (J — K s )o < 1.2, to a range of theoretical 
isochrones with [Fe/H]> —1 and age in range 6 — 13 Gyr. 
The intercept A however depends upon the age and metallic- 
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Fig. 2. — Latitude vs. longitude scatter plot of M-giant stars in the 
LMC region. Left: original data showing extinction regions. Right: 
the same region after adding a dispersion of 1° to stars satisfying 
y/(l - 280°. 0) 2 + (b + 33°. 0) 2 < 10°. 




Fig. 3. — An Aitoff plot in galactic coordinates of the final 2MASS M- 
giant catalog that is used for the group-finding analysis. 

ity. Since we do not know the age and metallicity we choose 
to adopt a constant value of A — 3.26 that roughly bisects the 
distribution of Mk s versus (J — K s )o in the simulated stel- 
lar halos of Bullock & Johnston ( 2005) B One such halo is 
shown in FigureH]. The dashed lines with A = 3. 26 ± 1.1 rep- 
resent the range of scatter about this relationship. A detailed 
discussion of the impact of our assumption of a consta nt ag e 
and metallicity for detecting structures is given Section l33l 

3. METHODS 

3.1. Group finding 

In this paper we use the density-base d hierarchical group- 
finder EnLink (described in detail in Sharma & Johnston 
2009) that can cluster a set of data points defined over an 
arbitrary space. For our application the stars are treated as 
the data points and the coordinates of the data points are de- 
fined by the position of the stars in three-dimensional space. 
The group finding s cheme of EnLin k is similar to IS O PEN 
dPfitzner etal.1 19971) and SUBFIND (ISpringel etal.l2001l) and 
is based on the fact that a system having more than one group 
will have peaks and valleys in the density distribution, the 
peaks being formed at the center of the groups and the valleys 
or saddle points where they overlap. The peaks are identified 
as groups and the region around each peak, which is bounded 
by an isodensity contour corresponding to the density at the 
valley, is associated with the group. This is shown schemati- 

1 The simulated halos were converted into a synthetic catalog of stars by 
utilizing isochrones from the Padova group ( Bertelli et al. 1994; Marig o et"aTI 
2008; Bonatto et al. 2004). A code was developed for this, details of which 
will be presented in a forthcoming paper (S. Sharma et al. 2010, in prepara- 
tion) 
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Fig. 4. — Absolute magnitude of M-g iants as a function of i ts color in a 
K s < 14 volume limited sample of Bullock & Johnston (2005) simulated 
stellar halo (halo-2). The relationship is well represented by a function of the 
formM Ks = A-9A2(J-K S ). The solid line with a value of A = 3.26 is 
found to roughly bisect the distribution of points in the plot. The dashed lines 
with A = 3.26 ± 1.1 represent the range of scatter about this relationship. 
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Fig. 5. — Schematic illustration of the group-finding scheme in one dimen- 
sion. The plot shows the distribution of density in space for three superposed 
Gaussian distributions along with a noise of 0.02 dex. The substructures that 
are bounded by a valley are represented by the thick light gray (orange) curve. 
The maximum and minimum values of the density in a group are used to cal- 
culate its significance. 

cally in Figure[5]for a one-dimensional case. The valleys also 
define connections between groups and these are used to as- 
sign a parent/child relationship between the groups, resulting 
in a hierarchy of clusters. 

To implement the above scheme EnLink first calculates the 
density using a nearest neighbor scheme, where the number of 
nearest neighbors ka en is fixed and is supplied by the user. A 
list of fcii n k = 10 nearest neighbors for each data point is also 
computed and stored. Next, the points are sorted according to 
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their density in descending order and stored in a list. Starting 
from the densest, each point from this sorted list is chosen 
successively and acted upon according to three options: 

(i) If the point does not have a neighbor denser than itself 
a cluster is created and the particle is added to it; 

(ii) If its denser neighbors belong to a unique cluster the 
particle is also added to it; 

(iii) If the denser neighbors belong to different clusters, 
the two nearest clusters are selected and the point is 
added to the cluster having the closest neighbor. Also, 
the smaller of the two nearest clusters becomes a sub- 
cluster of the larger (now the "parent" cluster) and all 
future particles that need to be added to the smaller 
cluster are added to the parent from then on — a process 
known as sub-cluster attachment. 

EnLink employs an additional strategy to screen out spu- 
rious groups that can arise due to Poisson noise in the data. 
EnLink defines the significance S for a group as a ratio of 
signal associated with a group to the noise in the measure- 
ment of this signal (see Figure [5] for a schematic illustration). 
The contrast ln(p max ) — ln(p m i n ) between the peak density 
of a group (pmax) and valley (p m in) where it overlaps with 
another group can be thought of as the signal, and the noise 
in this signal is given by the variance ci n (p) associated with 
the density estimator. Combining the definitions of signal and 
noise then leads to 

s = m(/Wx) ~ m(p m in) ^ 
0"lnp 

For Poisson-sampled data the distribution of density as esti- 
mated by the code using the kernel scheme is log-normal and 

the variance satisfies the relation 0"i n (» = yVd\ \W\ [fAden, 
where ka en is the number of neighbors employed for den- 
sity estimation, Vd the volume of a <i-dimensional unit hy- 
persphere and HWljo the L2 norm of the kernel function 
( Shar ma & Joh nston 2009|). For our case, d = 3 and k^en = 
30 and the variance is oWp) =0.22. 

The distribution of the significance parameter S is close to a 
Gaussian function for Poisson-sampled data. This implies that 
spurious groups in general have low S and their probability of 
occurrence falls off like a Gaussian distribution with increas- 
ing S. Hence, selecting groups using a simple threshold in the 
significance Srh can get rid of the spurious groups. EnLink 
uses this recipe to calculate the significance of the groups. All 
groups below Srh are denied the status of a group and are 
merged with their respective parent groups. 

3.2. Parameter Choices 

The number and properties of groups recovered by our clus- 
tering algorithm depend in part on the parameters adopted for 
the group finder itself, as well as how the data are transformed 
from observable to real-space. In this section, we first de- 
fine measures to evalu ate the performance of our clustering 
scheme (Section [3.2. It and subsequently use thes e meas ures 
to guide our choice of data transfo rmatio n (Section [3. 2. 21) and 
group-finding parameters (Section [3. 2. 31) . 

3.2.1. Evaluation of clustering 

Let Q be a set of data points with two partitions / and J, / 
being the set of intrinsic classes that are known a priori and J 



being the set of groups or clusters found by the group finder. 
In our case the data points are the stars in the halo and the 
intrinsic classes are the individual satellite systems that make 
up the halo. Overlaps between the two partitions are given by 
the contingency matrix , which gives the number of data 
points common to both class i G I and group j G J. The 
class that is most frequent (argmax iG/ n^ ) in a group is the 
class discovered by the group, and Di is the set of all groups 
in which class i is discovered. 

One measure of success for our group finder is the degree to 
which recovered groups represent intrinsic classes, which in 
our case correspond to real physical associations. We there- 
fore define purity as the fraction of correctly classified points 
in a group j : 

Pu ri t y(j -) = maX ^^} , (7) 
n.j 

where n.j = J2 i n ij * s me tota l number of data points in that 
group. The mean value of purity P = ^2 Purity(j)/| J| is 
then a good indicator of the overall quality of the clustering. 

We would also like to know how much of an intrinsic class 
can typically be recovered — in our case this corresponds to re- 
constructing long-dead satellites. In clustering algorithms, the 
fraction of correctly classified points in a class summed over 
all groups where the class is discovered is traditionally known 
as the recall of a class. We modify this definition slightly to 
also take into account the purity of the discovered points and 
define penalized recall as 

PRecall(i) = J2 — (Purity (j) - 0.5)2, (8) 

where ni. = ■ is the total number of data points in class 
i. The total value of penalized recall, ^PRecall(z), repre- 
sents the mean number of classes discovered by the group 
finder along with a penalty term for classes discovered with 
purity less than 0.5. This is a good indicator of the overall 
amount of clustering. 

While mean purity and total penalized recall are sensitive 
to different aspects of clustering, in many situations they vary 
inversely with each other and hence both of them should be 
taken into account when evaluating clustering success. We do 
this by defining a clustering performance index (CPI), which 
is given by 

CPI = V PRecall(i) ^ P ™ ty ^ (9) 

% 1 1 

The larger the value of CPI the better are the clustering re- 
sults. Typically the value of CPI ranges between and |/|. 
The maximum value occurs when both the mean purity and 
total recall have their maximum values, which are 1 and \I\ 
respectively. In some extreme circumstances, e.g., when the 
total recall is negative, CPI can be negative and the minimum 
possible value is -|/|. 

3.2.2. Choice of coordinate system and metric 

The efficiency of detecting structures in a data set depends 
upon the choice of the coordinate system in which the data 
are described and the metric (a function of coordinates that 
defines the distance between any two points in a space) used 
to calculate distances. The simplest metric is the Euclidean 
metric — appropriate when all the dimensions are of the same 
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physical units, such as the Cartesian coordinate system de- 
fined by the x,y,z position of stars in a three-dimensional 
space. The observational data of stars, however, are in a spher- 
ical coordinate system given by the two angular positions on 
the sky and the radial distance. If the uncertainty associated 
with the coordinates is small, the data can be easily converted 
to the Cartesian system. More realistically, the angular coor- 
dinates can be directly measured with very high precision but 
the radial distance needs to be estimated indirectly from the 
properties of the stars and hence has large u ncertainty asso- 
ciated with it. For example, as discussed in Maj ewski et al.l 
(2003) we expect a distance uncertainty of about 18% for 
the M-giants in our sample, and in this case using the sim- 
ple Cartesian coordinate system could severely degrade the 
quality of clustering. 

A common solution in cases having large uncertainty in one 
of the coordinates is to perform a dimensionality reduction 
and analyze the data in a lower dimensional space — for ex- 
ample, in our case using angular positions alone. An alterna- 
tive to ignoring the radial dimension altogether is to redefine 
the radial coordinate in a logarithmic scale and then use this 
modified radial coordinate to convert the data to a Cartesian 
system. The advantage of this transformation lies in the fact 
that while the dispersion in radial distance r increases linearly 
with r, the dispersion in modified radial coordinate log(r) is 
constant. This motivates a transformation of our radial coor- 
dinate to r' = 5(log(r/(10pc))) — /io where /io is a constant 
that determines the degree to which the radial dimension is ig- 
nored or used. If /io is small the data lie in a thin shell, which 
is equivalent to ignoring the radial dimension altogether. On 
the other hand, if /io is large the radial dimension is given 
more prominence. 

In order to demonstrate the effectiveness of our coor- 
dinate transformation we applied the group-finder EnLink 
(with parameters k^en = 30 and Srh = 4.25) to a syn- 
thetic stellar halo survey g enerated from the simulations of 
Bullock & Johnstonl ([2005). As a particularly stringent test 
we chose to look at a stellar halo that had been constructed en- 
tirely from low-luminosity satellites and hence contained nu- 
merous small, low-contrast structures rather than a few large 
ones (corresponding to the "low-luminosity halo", amongst 
the si x non ACDM halo models described in Johnston et al. 
2008). A color limit of 0.1 < g - r < 0.3 and a magnitude 
limit of M r < 24.5 (in the SDSS ugriz band) were used to 
generate the model halo. Two samples were generated from 
the model, both with and without distance errors — referred to 
as data T and data T error respectively. The group finder was 
run in both the normal coordinate system and the modified 
coordinate system (with /io =8). For data T error we assumed 
a distance uncertainty of a r /r = 0.25. To compare clus- 
tering we use two measures: the number of detected groups 
G a nd the clustering performance index CPI (defined in Sec- 
tion [3]2J]). The results are tabulated in Table [T] It can be seen 
that for data without errors the clustering results are similar in 
both the coordinate systems, but for data with errors, cluster- 
ing is better in the modified coordinate system as evidenced 
by the increase in both G and CPI. 

Next to choose an appropriate value of /io we compared the 
clustering results, for the data T error , in the modified coordi- 
nate system with different values of /io. The CPI was found to 
be maximum at /io ~ 8 and hence we adopt this value for rest 
of our analysis. It should be noted that the clustering results 
were not strongly sensitive to the exact choice of /io, in fact 
CPI was found to vary very little in the range — 10 < /io < 10. 
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In general, decreasing /i was found to increase the number 
of detected groups. However, the mean purity of groups was 
found to decrease with /i , so choosing a value of /io too small 
would mean greater contamination by spurious groups. 

3.2.3. Optimum choice of group-finding parameters 

The two free parameters in the group-finder are the num- 
ber of neighbors employed for density estimation, k^ en , and 
the significance threshold, 5rh, of the groups. We select 
^den = 30: a smaller value makes the results of the clus- 
tering algorithm sensitive to noise in the data, while a larger 
value means that small structures go undetected. 

The choice of the second free parameter, St\i, is governed 
by the desire to make the expected number of spurious groups, 
which can arise due to Poisson noise in the data, either con- 
stant or zero. This is important if one wants to reliably use the 
number of detected groups as a measure of clustering strength. 
For a (i-dimensional data consisting of N points an optimum 
value of 5rh can be chosen by considering the number of spu- 
rious groups with significance greater than Srh expected for a 
Poisson-distributed data (i.e., data points being distributed in 
a finite region of space uniformly but randomly). The required 
expression is given by 



G(>^Th) = (l-erf(^ T h/v / 2)) 



15.57V 

,72.1 £.1.2 
a 'Men 



(10) 



( Sharma & Johnston 2009). Since the presence of even one 
or two spurious groups can severely contaminate the analysis 
of structures we calculate the optimum value of Srh for a 
given N by setting the expected number of spurious groups 
G(> Srh) = 0.5 in equation (ITOl) and solve for Srh- Using 
this method we find Srh = 3.75 for N = 10 5 (typical size 
of the data analyzed in this paper). In general, decreasing 
5rh decreases the number of recovered groups and the value 
of total recall, but increases the mean purity. On the other 
hand increasing Srh has exactly the opposite behavior. This 
suggests that CPI should be maximum at some optimum value 
of Srh- ln our tests on synthetic halos we do see this behavior, 
i.e., for values of Srh for which G(> Srh) =0.5, CPI also 
tends to be maximum. 

As a final confirmation of our choice of threshold Srh, we 
generated a data set that contained only noise by replacing 
the latitude and longitude measured for 2MASS M-giant stars 
with values selected at random from a uniform distribution 
over a sphere but excluding the low latitude regions (as in the 
case of the real 2MASS M-giant sample). We then applied 
the group-finder to this randomized data-set with Srh = 1. 
The distribution of significance S for the recovered groups is 
shown as the dotted histogram in the top panel of Figure [6l 

2 Note that the significance of a real group in a given data set also increases 
with an increase in N, primarily due to the improved spatial resolution and 
secondarily due to the nature of Poisson noise. 
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Fig. 6. — Distribution of significance S for the groups identified by the 
group-finder. Groups with S > 5 were assigned a value of S = 5. The plot 
shows the results for the 2MASS M-giant data and a randomized 2MASS M- 
giant data created by choosing the latitude and longitude at random so as to 
have a uniform distribution over a sphere. 

The groups have a distribution that is like the tail of a Gaus- 
sian, with very few groups having S > 3.75. The distribution 
of groups recovered from the real 2MASS M-giant sample 
(solid line) is similar to that for randomized data for S < 3.75. 
However, for S > 3.75 several extra groups can be seen. This 
suggests that choosing a significance threshold of 5rh = 3.75 
to identify groups in a survey containing 10 5 points will min- 
imize contamination by spurious groups. 

3.3. Impact of the assumption of a single age and metallicity 

In Section [2] we had tentatively assumed a value of A — 
3.26 in the color magnitude relation (represented by Equa- 
tion ©, referred to as CMR hereafter), which corresponds to 
assuming a single age and metallicity for all the stars. We now 
revisit this issue and study the impact of this assumption for 
group-finding studies. 

First, we note that as a consequence o f work ing in a space of 
modified radial coordinate (see Section IT. 2. 2K there is a com- 
plete degeneracy between the choice of parameters /io and 
A. Since our analysis in Section IT. 2. 21 has already shown that 
group finding is insensitive to the exact choice of /io, the same 
applies for A. The relative insensitivity of clustering to /io or 
A is because a change in value of either of them leads to a 
mere translation of the data in the radial direction while the 
geometry of structures within the data remains almost intact. 

Although the ability to identify structures is not sensitive to 
the exact choice of A, it is sensitive to the scatter of the stars 
about the adopted CMR (as shown in Figure [4]). The standard 
deviation of distance modulus computed using the adopted 
CMR for the full halo was found to be 0.51. This high value 
of a M is mostly due to systematic differences in metallicities 
and ages between satellite system rather than large ranges in- 
ternal to each system. These systematic differences simply 
translate the structures relative to each other in space, an ef- 
fect which does not significantly hamper how well they can be 
detected. For the purpose of detecting structures what matters 
most is the for indiv idual satellite systems. Usi ng; the sim- 
ulated stellar halos of Bullock & Johnston! (|2005[) we found 
the mean value of a M for individual satellite systems to be 
0.34, i.e., distance uncertainty a r /r = 0.15, in accordance 
with our expectation. These dispersion estimates are also in 
agreement with the results of iMajewski et al.l (|2003|) . where 
they report = 0.36 for the 2MASS M-giants in the core of 
Sagittarius. 

Our previous discussion suggests that using the 2MASS 
M-giants along with our adopted CMR for distance deter- 
mination should be roughly equivalent to using a data set 



with about 15% dispersion in distance estimates. To test this 
we em ploy th e same low luminosity halo that was used in 
Section 13.2.21 but now generate a sample of M-giants using 
the color magnitude limits as described in Section [2] for the 
real 2MASS M-giant data. Equation (ITQb was used to select 
the optimum Srh relevant for the present data size and the 
group finder was run once with 15% errors in distance (data 
^M-giants) anc * once w i m distance computed using Equa- 
tion © (data Tm -giants)- The results are tabulated in Table [T] 
It can be seen that both data sets give nearly the same number 
of groups which demonstrates that for the purpose of detect- 
ing groups, the effect of using a constant age and metallicity 
is similar to that of data with 15% error in radial distances. 

Comparing the number of detected groups in Table Q] for 
different data sets also allows us to compare the overall group- 
finding efficiency of different schemes. We find that of the 
groups that could have been detected without any distance er- 
rors (data set T), only 30% are detected by data T error and 
15% by data Tm -giants- Although the drop in ability to detect 
groups is quite dramatic, it is mainly a reflection of the fact 
that fainter structures that are harder to detect are much more 
numerous than the brighter and easily detectable structures. 
Additionally, our results are biased by the fact that we use 
a hypothetical halo dominated by low mass accretion events 
which are also the ones that are preferentially missed in a data 
with measurement errors. Hence, for a realistic ACDM halo 
we expect the percentage of detected groups to be slightly 
higher. 

Next, we compare the number of detected groups for data 
Tm -giants with data T error . Although the distance error for 
data Tm -giants is less than that for data T error , the number of 
detected groups is still about a factor of 2 lower. Three fac- 
tors could be responsible for this. First, the sample size for 
Tm -giants is three order of magnitude lower than that for data 
Terror, which means that the data Tm -giants has lower spa- 
tial resolution and this makes identification of groups difficult. 
Second, M-giant data is biased toward detecting high metal- 
licity, intermediate-age stars and would miss low metallicity 
systems or those accreted long ago, which dominate by num- 
ber. Finally, the high mass systems, which are preferentially 
sampled by M-giants due to their high metal content, are also 
the most phase mixed ones and contribute more to the smooth 
background, making structure detection even more difficult. 
In fact, in a forthcoming paper (S. Sharma et al. 2010, in 
preparation) we demonstrate that, despite the low number of 
stars, a 2MASS type survey can recover most of the structures 
that originate from high mass progenitors and are on orbits of 
low eccentricity. 

4. RESULTS: STRUCTURES TRACED BY M-GIANTS IN THE 2MASS 

Applying our group-finder with k^en = 30 and Srh = 3.75 
to the 2MASS M-giant sample set reveals 16 groups. An 
Aitoff plot of the groups is shown in Figure|7]where each iden- 
tified group is coded with a unique color and the filled circles 
mark the position of the densest particle in the group. A sum- 
mary of the group properties is shown in Table [2] Listed in 
the table are the name of the groups, the galactic latitude and 
longitude of the density peak in the groups, the number of 
stars in the groups, the significance parameter of the groups, 
the value of peak density and the radial distance of the groups. 
The first 10 groups listed in the table can be associated with 
known structures in the Local Group, while the other six are 
new candidate structures. 

Among the known structures that have been identified by 
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Fig. 7. — Groups found in the 2MASS M-giant sample shown in Aitoff projection maps centered at 0° longitude (upper panel) and 180° longitude (lower 
panel). Large black filled circles mark the position of the density peak in a group. Stars in each group are color coded with a unique color and are shown as small 
filled circles. The solid black lines mark the low latitude area that is excluded from the analysis. Note that group A7 lies on the top of group A2. 



TABLE 2 

Summary of Groups Found in the 2MASS M-giant Sample. 



Name 


Description 


I 


b 


fT'stars 


SigS 


Ppeak 

(kpc) 


Distance^ 


Al 


LMC 


282.865 


-32.231 


49234 


52.9 


2.7 xlO 4 


60.1 ±30 


A2 


SMC 


301.823 


-43.925 


4001 


33.4 


3.0 xlO 4 


64.0 ± 32 


A3 


Sag leading arm, north 


358.130 


27.985 


3245 


27.5 


6.5 xlO 1 


63.1 ±32 


A4 


Sag core 


5.51100 


-20.053 


1460 


24.4 


1.8 xlO 3 


37.2 ± 19 


A5 


Sag trailing arm, south 


157.190 


-62.682 


226 


4.82 


1.0 xlO 1 


37.2 ± 19 


A6 


Andromeda 


120.819 


-22.212 


117 


4.49 


6.5 


122.0 ±61^ 


A7 


Group in SMC 


302.436 


-43.837 


83 


5.13 


1.7 xlO 4 


48.6 ± 24" 


A8 


NGC 6822 


25.393 


-18.378 


78 


4.74 


1.5 xlO 1 


92.6 ± 46 


A9 


Sag trailing arm, south 


187.953 


19.882 


64 


4.54 


2.9 


96.6 ± 48 


A10 


Fornax dwarf Sph 


238.091 


-65.798 


39 


7.58 


7.3 


121.3 ±60 


All 


Near mask 


164.086 


24.992 


79 


5.18 


3.4 


88.2 ± 44 


A12 


Probably Monoceros ring 


317.865 


21.908 


307 


5.40 


7.5 


21.8 ±11 


A13 


Near mask 


143.738 


30.936 


54 


3.93 


2.1 


22.6 ±11 


A14 


Has protrusions to high b 


56.9910 


-27.865 


203 


5.23 


8.9 


97.7 ± 48 


A15 


Near a strong extinction region 


316.906 


-29.868 


76 


4.99 


2.7 xlO 1 


98.6 ± 49 


A16 


In Pisces constellation 


104.793 


-52.535 


126 


6.25 


9.9 


102.9 ±51 



a distance limits computed assuming a scatter of 1 . 1 mag in distance modulus 

b The actual distance of Andromeda is around 778 kpc and this is much higher than what we have derived using M-giants. This discrepancy is because the 
detected M-giants from Andromeda are very rare and bright giants which do not fall on the color magnitude relationship that we assume for calculating distances. 
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Fig. 8. — Comp arison of det ected structures with features in dust infrared 
emission maps of Schlegel et al. (1998). The panels show the distribution of 
dust extinction (top two panels) and dust color temperature (lower panel), as 
a function of galactic longitude and latitude in the southern hemisphere. The 
location of the structures are marked as circles on the plots. The top two 
panels are the same except for the fact that in the top panel, stars associated 
with the structures in the 2MASS M-giant sample are overplotted. It can be 
seen that structures A15 and A16 are associated with features both in the 
extinction and temperature maps. 

the group-finder the densest and most prominent are bound 
satellite systems such as the Magellanic Clouds (LMC and 
SMC0),andthe core of the Sagittarius dwarf galaxy. Unbound 
debris in the form of the streams from Sagittarius are traced 
beautifully by means of the structures A3, A5 and A9. Galax- 
ies in the Local Group, like the Andromeda galaxy, NGC 6822 
and the Fornax dwarf spheroidal galaxy, which contribute as 
little as 30-100 stars in the sample, are also re-discovered. 
These findings both demonstrate the success of the group- 
finding scheme and lend credibility to the newly discovered 
structures. 

While our group-finding technique has been successful at 
revealing some of the know n structures, ot hers are missing: 
e.g,. th e Virgo overdensity djuric et al. 2008), the Virgo stellar 
stream (IVivas et al. 2001; Duffau et al. 20 061), the Canis Ma- 
jor dw arf galaxy (iMartin et al. 2004; Martinez-Del gad o'et al.l 
2005) and the Hercules Aquila cloud (Belokurov et al. 2007). 
The absence of these structures can be understood in terms 
of our sample selection criteria: the Canis Major overdensity 
is at low latitude (b = —7°. 99) and hence outside the region 
explored in this paper; the Virgo stel lar stream is metal poor 
[Fe/H] = -1.86 as suggested bv iDuffau etafl (120061) and 

3 Group A7 is a subgroup embedded within group A2, which is the SMC, 
hence we consider A7 as a part of SMC 
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Fig. 9. — Plot in R.A. and decl. of groups found in the 2MASS M-giant 
sample that lie near the SDSS Stripe 82. The filled circles denote the point 
of peak density while the squares denote the data points belonging to the 
groups. The bottom panel shows all possible points that can be associated 
with the density peak of a structure by tracing the particles along the direc- 
tion of density gradients and have density greater than 0.175 times the peak 
density of the structure. 

hence is most likely not sampled by the metal rich M-giants; 
the Virgo overdensity is close to the Sun (6-20 kpc) and 
largely excluded by our selection criteria of K s > 10.0 (i.e., 
distance greater than about 15.0 kpc); the Hercules Aquila 
cloud is also nearby (10-20 kpc) and , moreover, the part of it 
in the northern hemisphere is centered at (l,b) = (30°, 20°), 
which is outside the region explored by us. 

Next we investigate the six newly discovered structures in 
Table [2 These could have a real physical association with 
satellite remnants or they could be artificial overdensities cre- 
ated by dust extinction regions, masks, contamination from 
disk stars or Poisson noise. For example, structures All, 
A 12, A13 and A 14 are all at low latitudes and hence pos- 
sibly associated with disk. In the cases of structures All 
and A13, both lie right at the edge of one of the rectangu- 
lar masks (see Figure Q] and Figure [5]) and this further un- 
dermines their authenticity. Structure A12 is elongated along 
the disk, is nearby (distance of about 23 kpc ) and i ts loca- 
tion matches that reported by lRocha-Pinto et al . ( 2003) f or the 
previously-identified Mono ceros ring (see also lYanny et al.l 
2003: lNewberg et aLll2002|) . Structure A14 also originates at 
low latitude, but shows a protrusion extending to high lati- 
tudes that suggests it could be a real halo structure. 

A comparison of the l ocation of the remain ing structures 
(A 15 and A 16) with the ISchlegel et al.l (119981) infrared dust 
emission maps (Figured]) shows that both are associated with 
high extinction and low temperature features in the maps. 
The extinction around structure A15 was found to be par- 
ticularly high while that around A16 is only mildly elevated 
(E(B - V) ~ 0.114 mag). Additionally, A16 is at high lati- 
tude and also not close to any masking region and this makes it 
a promising new structure that could correspond to a satellite 
accretion event. The close association with a low temperature 
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feature in the dust could either mean that the structure is an 
artifact of the extinction corrections (improperly deredenned 
stars getting spuriously included or excluded in our sample) 
or that the dust feature is associated with the real gas and dust 
in the structure itself. 

Moreover, a structure at a similar location, named the Pisces 
overdensity, has recently been discovered in a sam ple of 
RR Lyrae stars i n the SDSS Stripe 82 (IWatkins et alJl2009t 
ISesar et aI1l2007[) . The overdensity has also been spectroscop- 
ically confirmed by Kollmeier et al. ( 2009) using a sample of 
eight RR Lyraes from SDSS. They speculate it to be a bound 
satellite system based on the observed velocity dispersion of 
five of their stars being small (6 km s _1 ), but at the same time 
do not rule out the possibility that it is an unbound satellite 
system due to the large angular width of the overall structure. 

To investigate the correspondence of A16 to the Pisces 
overdensity we plot the groups identified in our 2MASS M- 
giant sample alongside the SDSS Stripe 82, in the top panel 
of Figure [9] Specifically, the Pisces overdensity has been 
reported to lie in the interval —25 <R.A.< , with the 
peak concentration being at R.A.~ — 5 and at a distan ce of 
r = 79.9±13.9kpc (as estimated in lWatkins et al.ll2009l) . The 
M-giants in structure A16 are very close to this peak along the 
boundary of the strip and at a similar distance given the high 
range of uncertainty (r = 103 kpc with a range of ±51) — the 
offset in distance could either be due to our arbitrarily adopted 
value of metallicity in calculating the distances to our stars, or 
to a dramatic distance gradient across the field. 

Given these similarities, it is striking that the upper panel 
of Figure [9] does not show any M-giants from A16 actually in 
Stripe 82. The most likely explanation for this lack of stars 
is that the number density of the M-giants associated with the 
A16 is relatively low within the stripe. Indeed, a comparison 
of the number of Sagittarius M-giants (group A5) with th at of 
Sagittarius RR Lyraes in Stripe 82 (IWatkins et al.l l2009) sug- 
gests the number of M-giants is probably a factor of 3 lower, 
implying that the density peak found in RR Lyraes with SDSS 
should contain only a few M-giants. Hence, the number den- 
sity of A16 could decrease sufficiently toward the stripe that 
it is cut off by the default criteria in the group finder itself, 
which truncates a group whenever it intersects a neighboring 
group. If we instead relax the default truncation criteria to also 
include points that converge to the point of peak density in 
A16 by following the path along local density gradients (i.e., 
densest nearest neighbor links), we find plausible extensions 
to the group. The bottom panel of Figure [9] plots the positions 
for this extended group with the minimum density threshold 
of points being set to 0. 175 times the maximum density within 
a group. Nearby groups extended with the same criteria are 
also shown alongside. It can be seen that the extended portion 
of A16 now matches the distribution of Pisces overdensity RR 
Lyraes in Stripe 82. 

We also note that the metallicity of Pisces overdensity ha s 
been reported to be [Fe/H] = -1.5 bv lWatkins et alJ d2009). 
which means that it would be almost undetectable by M- 
giants. But at the same time stars in a satellite system do 
span a range of metallicities, and M-giants could very well be 
sampling the high metallicity stars in the system. 

If A16 and Pisces overdensity are related then A16 not 
only offers an independent confirmation of the Pisces over- 
density, but also provides an extended view of it. In fact 
our results show that the point of peak density is located at 
(R.A., decl.)= (1°.81, 8°.77\ which is just outside the range 



of Stripe 82 in SDSS (by about 7° in decl.). We estimate 
the uncertainty in the angular position of our peak density to 
be SO = sin - l (r' k /r) = lo.7 (where r is the distance of 
the density peak and r' is the radius of the sphere enclosing 
the fifth nearest neighbor of the densest point) — smaller than 
the angular distance of the peak from the SDSS Stripe 82. 
Our results favor an interpretation of unbound satellite sys- 
tem or possibly a bound system within a larger overdensity. 
Such cloud-like structures are expected to be formed from 
satellites disrupting along eccentric orbits, while the classical 
rosette tails (such as t hose of the Sagittarius dwarf galaxy, see 
Majewski et al. 20 01) arise f rom objects on more circular or- 
bits (see I Johnston et al.ll2008l for a more complete discussion 
of characteristic morphologies). 

Finally, note that a smaller sub-concentration of RR Lyrae 
stars, at a median distance of 92 kpc, has also been noted 
in the interval — 25 to —20 of Stripe 82 (structure L in 
ISesar et all 120071) . which seems to coincide in angular posi- 
tion (see the upper panel of Figure [9]) and distance with a 
smaller sub-concentration of stars belonging to structure A 14 
in the M-giant survey (distances estimated to be 95, 80 and 88 
kpc for three M-giants lying in that region). Whether A 14 and 
A16 are truly associated with the structures in RR Lyraes in 
Stripe 82 (or with each other) can be tested by mapping their 
velocity and spatial structures. 

5. SUMMARY 

We have explored the use of a density based hierarchi- 
cal clustering algorithm to identify structures in the stellar 
halo. Application of the group finder to a simulated data 
demonstrated that in three-dimensional data sets with large 
dispersion in the radial dimension, a coordinate transforma- 
tion where the radial coordinate is in logarithmic units greatly 
improves the quality of clustering. 

As an application to a real data set we ran the group-finder 
on the 2MASS M-giant catalog and identified 16 structures in 
it — 10 of these are known structures and six are new. Among 
the six new structures, two are probably due to masks em- 
ployed on the data, one is associated with a strong extinction 
region, and one is probably a part of the Monoceros ring. An- 
other one originates at low latitude, suggesting contamination 
by disk stars, but also shows significant protrusions extending 
to high latitudes implying that it is a real feature in the stellar 
halo. 

One structure is free from these defects, has an overden- 
sity similar to that of known structures like the streams of the 
Sagittarius dwarf galaxy and is also slightly above the Pois- 
son noise. While these properties suggest that it is a genuine 
structure, possibly a satellite remnant, the structure was also 
found to match a low temperature feature in the dust map. The 
correspondence with a feature in dust map could either mean 
that the structure is an artifact of the extinction corrections or 
that the dust feature is associated with the real gas and dust in 
the structure itself. 

The position and distance of the detected structure closely 
match those of the Pisces overdensity, which has been re- 
cently discovered using RR Lyraes in the SDSS Stripe 82. If 
A16 is indeed related to Pisces overdensity then our analysis 
using 2MASS M-giants provides an independent confirma- 
tion of the overdensity and offers an extended view of it. In 
addition, our analysis suggests that the peak point of density 
is located just outside the range of the SDSS stripe, which fa- 
vors the interpretation that the system is an unbound satellite 
system, probably corresponding to a debris from a satellite 
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disrupting along a fairly radial orbit. Deeper photometric sur- 
veys of this region along with spectroscopic measurements 
of the giant stars associated with the overdensity should help 
confirm or rule out this scenario. 

Overall we conclude that group finding is a promising tech- 
nique to unravel the history of our stellar halo and as a win- 
dow on accretion more generally. Clouds of debris like the 
Pisces overdensity are naturally found in model stellar ha- 
los built within a standard cosmological context, and are 
even predicted to be the dominant structures in th e outer halo 
(Bull ock & Johnston! [20051: Uohnston et al.ll2008l) . Indeed, if 
none were found we would conclude either that we live in 
a Galaxy that has suffered an unusual paucity of accretion 
events on radial orbits, or that our expectations of orbital 
distributions of accreting objects (gleaned from cosmological 
simulations of structure formation) are flawed. 

Future prospects for group-finding are even brighter: our 
analysis here has only used the three-dimensional spatial 
distribution of stars while many surveys also have velocity 



(proper motions and radial velocities of stars) and chemical 
abundance information. These additional dimensions should 
help recover more structures. Moreover, we have here used 
M-giants as tracers of the stellar halo. Since M-giants are 
metal rich stars this means that our sample is biased toward 
high-metallicity systems that originate from high mass pro- 
genitors and misses out on the much more numerous low mass 
systems that have low metallicity. Hence surveys utilizing a 
different tracer population, e.g., main sequence stars or RR 
Lyraes should unravel more structures in the stellar halo. 
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